1 Articulatory dynamics of /s/-retraction in Glasgow

This script is a supplement to Thielking (2019) and extends this work on /s/-retraction in Glasgow by including dynamic articulatory measurements of the lips from lip profile video, and of the tongue using ultrasound imaging.

2 Methods

2.1 Lip data analysis

Dynamic measures of lip protrusion were generated using the following procedure. While in Thielking (2019) lip protrusion was based on manual annotation of the lips at a single point, this work, inspired by King and Ferragne (2020a), uses deep learning to automatically segment the lips from the relevant lip video frames across the entire duration of the sibilants.

King and Ferragne (2020a) showed that the lips in /r/ and /w/ can be successfully segmented automatically with the help of Convolution Neural Networks (CCN) and transfer learning from the field of image recognition. Similarly, the present work aimed to implement a deep learning approach on grayscale profile view images of the lips of the sibilants /s/ and /∫/ in various contexts. Instead of training a CNN model from scratch, which requires large amounts of training data, transfer learning makes use of pre-trained models of a source domain by adopting these models to a new use case. In the current study, transfer learning was implemented with the help of the PixelLib python package. The MASK-RCNN model was trained for 50 epochs with a batch size of 4 using the Resnet-101 backbone on Google Colab. In the case of the present work, inference on images of lips resulted in the output of an image with the segmented lips and a corresponding binary image of the segmentation mask.

The training and validation data sets consist of 200 manually annotated images of the lip area. To improve segmentation accuracy, images of this dataset were taken from all speakers and different points of the sibilant duration as well as various sibilant contexts. In contrast to King and Ferragne (2020a), segmentation was not limited to representative midpoint frames of the sibilants, but automatic segmentation was extended to all frames across the sibilant duration, resulting in the segmentation of ~14000 images.

After training the model achieved a mAP(Mean Average Precision) of 0.89.

Work flow

  1. Manual annotation of training and validation data (170/30 images)
  2. Training of model
    • 50 epochs
    • batch size 4
  3. Automatic prediction and segmentation of all relevant sibilant lip video frames (~14000 images)
  4. Extraction of segmentation masks using custom python script
  5. Automatic calculation of lip protrusion following Lawson, Stuart-Smith, and Rodger (2019)

After all 14000 lip video frames were automatically segmented, the segmentations masks were extracted. In addition to the automatic segmentation of the lips, lip protrusion was also measured automatically with the help of the image processing tool OpenCV. Lip protrusion was measured following the procedure laid out in Lawson, Stuart-Smith, and Rodger (2019) and King and Ferragne (2020b) with only minor changes. First, a horizontal fiducial line was placed so that it intersected the participant’s corner of the mouth (see Figure 2.1. A second vertical fiducial line was positioned touching both the edges of the upper and lower lips. The length of the lips was thereafter measured along the horizontal fiducial at the intersection of these two fiducial lines. While the horizontal fiducial line was kept constant in all recordings, the vertical fiducial was adapted to the lip positioning, resulting in an increase or decrease of lip length along the horizontal fiducial. Lip length was measured in pixels. The edges of the upper and lower lips were automatically identified from the segmentation masks using OpenCV’s findContours(), convexityDefects() and convexHull() functions.

Lip protrusion measurements were z-scored for inter-speaker comparability.

Automatic segmentation of the mouth (in blue) via semantic segmentation using a CNN (left). Extraction of relevant lip landmarks and plotting of fiducials in segmentation mask (right).Automatic segmentation of the mouth (in blue) via semantic segmentation using a CNN (left). Extraction of relevant lip landmarks and plotting of fiducials in segmentation mask (right).

Figure 2.1: Automatic segmentation of the mouth (in blue) via semantic segmentation using a CNN (left). Extraction of relevant lip landmarks and plotting of fiducials in segmentation mask (right).

The video above shows a video of the word street produced by speaker F01 and the corresponding automatic segmentation of the lips as well as measured lip protrusion by frame. Note: speed has been reduced to 50%.

Table 2.1 shows the mean number of video frames captured along the duration of the sibilant for each speaker.

Table 2.1: Mean number of video frames by speaker across sibilant
speaker Mean_Frame_No SD_Frame_No
F01 5.344262 1.0252619
F02 6.021858 1.3944800
F03 4.896175 1.4769796
F04 4.699454 1.3995092
F05 3.273224 0.8781186
F06 6.453552 1.2565328
F07 6.005465 1.0403786
F08 3.551913 1.1075585
F09 3.557377 1.2024415
M01 3.934426 0.9585162
M02 6.054645 0.9818506
M03 4.125683 0.8191033
M04 5.333333 1.0181503
M05 4.240437 1.2911805
M06 4.551913 1.2297968
M07 4.622951 1.7399375

To verify the automatically segmented lip measurements a Pearson correlation test comparing these measurements to the manually annotated measurements in Thielking (2019) was conducted showing: \(r=0.41, p(one-tailed)<0.001\). Given this relatively low correlation, it has to be noted that Thielking (2019) annotated the maximum protrusion in the central portion of the sibilant, while the automatic method measured lip protrusion at the midpoint of the sibilant.

Plotting these midpoint protrusion measures, however, shows a similar pattern as in Thielking (2019). As can be seen in Figure 2.2, /∫/ displays the largest amount of lip protrusion of the four contexts under investigation. Furthermore, the two clusters /str/ and /stj/ show more protrusion than pre-vocalic /s/.

As expected, taking into account the following vowel (Fig. 2.3) shows that sibilants followed by rounded vowels display more lip protrusion compared to unrounded vowels. This effect seems to be largest for /sV/. Note that /stj/ only occurs in rounded vowel contexts.

Z-scored lip protrusion by context across all speakers.

Figure 2.2: Z-scored lip protrusion by context across all speakers.

Midpoint lip protrusion for various contexts and following vowel

Figure 2.3: Midpoint lip protrusion for various contexts and following vowel

Running a linear mixed effects model reveals statistically significant differences between /sV/ and /stj/ and /∫/. There is no significant difference between /sV/ and /str/.

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: z_lp ~ context * vowel + (1 | word) + (1 | speaker)
##    Data: lips
## 
## REML criterion at convergence: 12967.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -9.1188 -0.5209  0.0506  0.5847 15.8657 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  word     (Intercept) 0.01506  0.1227  
##  speaker  (Intercept) 0.01084  0.1041  
##  Residual             0.70208  0.8379  
## Number of obs: 5187, groups:  word, 22; speaker, 16
## 
## Fixed effects:
##                           Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)                0.08154    0.08138 16.93584   1.002 0.330461    
## contextstr                 0.17691    0.12330 14.51062   1.435 0.172561    
## contextstj                 0.43704    0.10282 14.31909   4.250 0.000770 ***
## contextsh                  0.90921    0.10867 13.67888   8.367 9.54e-07 ***
## vowelunrounded            -1.14858    0.10900 13.84618 -10.538 5.40e-08 ***
## contextstr:vowelunrounded  0.83104    0.16046 14.39894   5.179 0.000128 ***
## contextsh:vowelunrounded   0.55639    0.15391 13.76184   3.615 0.002886 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) cntxtstr cntxtstj cntxtsh vwlnrn cntxtst:
## contextstr  -0.592                                          
## contextstj  -0.710  0.469                                   
## contextsh   -0.672  0.443    0.532                          
## vowelunrndd -0.670  0.442    0.530    0.502                 
## cntxtstr:vw  0.455 -0.768   -0.360   -0.341  -0.679         
## cntxtsh:vwl  0.474 -0.313   -0.375   -0.706  -0.708  0.481  
## fit warnings:
## fixed-effect model matrix is rank deficient so dropping 1 column / coefficient

Tukey-corrected pairwise comparisons reveal significant differences between unrounded and rounded following vowels, in that rounded vowels show more protrusion than unrounded vowels in /sV/ and /∫/. However, there is no significant effect for /str/.

## $`emmeans of context, vowel`
##  context vowel      emmean     SE  df asymp.LCL asymp.UCL
##  sV      rounded    0.0815 0.0814 Inf    -0.078    0.2410
##  str     rounded    0.2584 0.0997 Inf     0.063    0.4539
##  stj     rounded    0.5186 0.0729 Inf     0.376    0.6615
##  sh      rounded    0.9907 0.0809 Inf     0.832    1.1494
##  sV      unrounded -1.0670 0.0814 Inf    -1.227   -0.9075
##  str     unrounded -0.0591 0.0727 Inf    -0.202    0.0834
##  stj     unrounded  nonEst     NA  NA        NA        NA
##  sh      unrounded  0.3986 0.0814 Inf     0.239    0.5580
## 
## Degrees-of-freedom method: asymptotic 
## Confidence level used: 0.95 
## 
## $`pairwise differences of context, vowel`
##  1                             estimate     SE  df z.ratio p.value
##  sV rounded - str rounded        -0.177 0.1233 Inf  -1.435  0.8412
##  sV rounded - stj rounded        -0.437 0.1028 Inf  -4.250  0.0006
##  sV rounded - sh rounded         -0.909 0.1087 Inf  -8.367  <.0001
##  sV rounded - sV unrounded        1.149 0.1090 Inf  10.538  <.0001
##  sV rounded - str unrounded       0.141 0.1027 Inf   1.370  0.8712
##  sV rounded - stj unrounded      nonEst     NA  NA      NA      NA
##  sV rounded - sh unrounded       -0.317 0.1090 Inf  -2.908  0.0709
##  str rounded - stj rounded       -0.260 0.1179 Inf  -2.207  0.3474
##  str rounded - sh rounded        -0.732 0.1230 Inf  -5.953  <.0001
##  str rounded - sV unrounded       1.325 0.1233 Inf  10.750  <.0001
##  str rounded - str unrounded      0.318 0.1177 Inf   2.697  0.1234
##  str rounded - stj unrounded     nonEst     NA  NA      NA      NA
##  str rounded - sh unrounded      -0.140 0.1233 Inf  -1.136  0.9489
##  stj rounded - sh rounded        -0.472 0.1025 Inf  -4.608  0.0001
##  stj rounded - sV unrounded       1.586 0.1028 Inf  15.420  <.0001
##  stj rounded - str unrounded      0.578 0.0961 Inf   6.011  <.0001
##  stj rounded - stj unrounded     nonEst     NA  NA      NA      NA
##  stj rounded - sh unrounded       0.120 0.1028 Inf   1.167  0.9412
##  sh rounded - sV unrounded        2.058 0.1087 Inf  18.937  <.0001
##  sh rounded - str unrounded       1.050 0.1023 Inf  10.259  <.0001
##  sh rounded - stj unrounded      nonEst     NA  NA      NA      NA
##  sh rounded - sh unrounded        0.592 0.1087 Inf   5.450  <.0001
##  sV unrounded - str unrounded    -1.008 0.1027 Inf  -9.816  <.0001
##  sV unrounded - stj unrounded    nonEst     NA  NA      NA      NA
##  sV unrounded - sh unrounded     -1.466 0.1090 Inf -13.446  <.0001
##  str unrounded - stj unrounded   nonEst     NA  NA      NA      NA
##  str unrounded - sh unrounded    -0.458 0.1027 Inf  -4.457  0.0002
##  stj unrounded - sh unrounded    nonEst     NA  NA      NA      NA
## 
## Degrees-of-freedom method: asymptotic 
## P value adjustment: tukey method for comparing a family of 8 estimates

Now, turning to the dynamic lip protrusion trajectories gives some hints as to why Thielking (2019) found a statistically significant difference between /sV/ and /str/. Figure 2.4 shows that lip protrusion in /str/ contexts increases at the latter part of the central portion of the fricative, roughly around 70% of the duration, while /sV/ remains rather stable. The contexts /stj/ and /str/ behave similarly in that they show a continuous increase in lip protrusion. However, overall /stj/ displays more protrusion than /str/ and is closer to /∫/. /∫/ has the largest amount of lip protrusion and reaches its maximum slightly after the midpoint of the fricative at around 60%.

Dynamic lip protrusion across sibilant duration for various contexts

Figure 2.4: Dynamic lip protrusion across sibilant duration for various contexts

Running a GAMM analysis and plotting the results 2.5 confirms the results plotted in 2.4. (However, I’m not sure how to fit the GAMM correctly since the number of data points varys significantly between the different trajectories and k has to be fit to number of data points (i.e. number of frames) - 1)

Results of Generalised Additive Mixed Effects model

Figure 2.5: Results of Generalised Additive Mixed Effects model

As can be seen in Figure 2.6 there are striking differences in the trajectories of lip protrusion between sibilants followed by rounded and unrounded vowels. As expected rounded contexts exhibit significantly more lip protrusion than unrounded contexts. Especially, /sV/ shows major differences in terms of lip protrusion dynamics. While lip protrusion remains rather stable in unrounded contexts, the trajectory in rounded contexts resembles that of /str/ and /stj/ showing an increase in protrusion up to 75% of the sibilant before it starts to taper off. ultr

Dynamic lip protrusion across sibilant duration by following vowel

Figure 2.6: Dynamic lip protrusion across sibilant duration by following vowel

Looking at each speaker individually reveals that the speakers differ in terms of the dynamics of lip protrusion. Some speakers show almost similar lip protrusion trajectories in /str/, /stj/ and /∫/ (M02), while others show a split between /∫,stj/ and /sV, str/ (F07, F08). There are also differences in terms of protrusion dynamics while some speakers show rather flat i.e. unchanging lip configurations, some show an increase in protrusion throughout the entire sibilant.

Dynamic lip protrusion across sibilant duration for various contexts and speakers

Figure 2.7: Dynamic lip protrusion across sibilant duration for various contexts and speakers

2.1.1 All sibilant contexts

2.1.1.1 Static

Static lip protrusion at sibilant midpoint for all contexts across all speakers

Figure 2.8: Static lip protrusion at sibilant midpoint for all contexts across all speakers

Figure 2.8 shows z-scored lip protrusion at sibilant midpoint across all contexts.

2.1.1.2 Dynamic

Dynamic lip protrusion across sibilant duration for all contexts across all speakers

Figure 2.9: Dynamic lip protrusion across sibilant duration for all contexts across all speakers

Figure 2.9 shows lip protrusion in all contexts investigated. As expected /∫r/ and /t∫/ pattern with /∫/ in that they show the highest degree of lip protrusion. The other sibilant contexts /sp/, /sk/ and /st/ pattern with pre-vocalic /s/. While /spr/ and /skr/ show slightly more lip protrusion, this protrusion is however lower than /str/ and /stj/.

Taking into account the following vowel, however, reveals an interesting pattern see (Figure 2.10) . In unrounded vowel contexts, /skr/ and /spr/ are shifted in the direction of /∫/ and show similar protrusion as /str/. In following rounded contexts, this effect is smaller as /skr/ and /spr/ pattern with their non-rhotic counterparts. Interestingly, /∫r/ shows a similar degree of protrusion in both contexts, while /∫V/ has less protrusion in the unrounded context. This could likely be due to the influence of the rhotic on lip protrusion. Looking at the tongue posture might provide some insights into the relationship of type of rhotic and lip protrusion. As found by King and Ferragne (2020b): bunched /r/ shows more protrusion than retroflex/tip-up /r/. Previous studies on /s/-retraction suggest a similar relationship: bunchers show more retraction likely due to more lip protrusion i.e. lip protrusion might be more important than lingual configuration???

Dynamic lip protrusion across sibilant duration for all contexts and following vowel across all speakers

Figure 2.10: Dynamic lip protrusion across sibilant duration for all contexts and following vowel across all speakers

2.2 Tongue data analysis

Midsagittal tongue ultrasound imaging was used to look at differences in tongue shape. In order to look capture and quantify dynamic differences in tongue shape Principal Components Analysis (PCA) and Linear Discriminant Analysis (LDA) was applied to the ultrasound images (see Smith et al. (2019), Faytak, Liu, and Sundara (2020) and Strycharczuk and Sebregts (2018) for a similar approach).

Processing of the ultrasound images followed Faytak, Liu, and Sundara (2020) using functions from the Python package SATKit (Faytak, Moisik, and Palo (2020)). First a series of series of filtering operations to reduce speckle noise in the ultrasound signal to improve the signal-to-noise ratio for the analysis was applied to all relevant sibilant ultrasound frames (see Carignan (2014)). Figure 2.11 shows an unprocessed ultrsound frame and Figure 2.12 shows the same frame after filtering and resizing. The processed frames were then submitted to PCA for each speaker individually, taking the pixel-values as input. The first 50 PCA scores were retained, which capture at least 80% of the variance in each speaker’s uti frames. “For each speaker, the scores for these PCs and the target of each token, i.e., /sV/, /str/, /stj/ ,/∫/, were submitted to LDA. The resulting linear discriminant (LD) score, when normalized to a [-1,1] range for all speakers, may be taken as an index of how distinctly /sV/- or /∫/-like the sibilant in each token is: The LDA was structured such that /∫/ was consistently near -1, and /sV/ was consistently near 1, for all speakers.” (Faytak, Liu, and Sundara (2020)).

Unprocessed UTI frame by speaker M06.

Figure 2.11: Unprocessed UTI frame by speaker M06.

Filtered UTI frame by speaker M06.

Figure 2.12: Filtered UTI frame by speaker M06.

Figure 2.13 shows the results of the LDA for the four contexts /sV/, /str/, /stj/ and /∫/. As can be seen, across all speakers, /str/ and /stj/ have distinct lingual configurations from both /sV/ and /∫/. /stj/ is however closer to /∫/ than /str/, reflecting the higher /∫/-likeness of /stj/ in lip rounding and acoustics. Both /str/ and /stj/ become more /∫/-like towards the end of the sibilant.

LD score trajectories across sibilant duration

Figure 2.13: LD score trajectories across sibilant duration

LD score trajectories across sibilant duration by following vowel

Figure 2.14: LD score trajectories across sibilant duration by following vowel

LDA score trajectories across sibilant duration by speaker

Figure 2.15: LDA score trajectories across sibilant duration by speaker

LD score of central portion of sibilants across all speakers

Figure 2.16: LD score of central portion of sibilants across all speakers

LD score of central portion of sibilants for all speakers

Figure 2.17: LD score of central portion of sibilants for all speakers

2.2.1 DeepLabCut: Automatic estimation of Ultrasound Tongue data

In a recent paper Wrench and Balch-Tomes (2022) leverage the power of the DeepLabCut suite to automatically track the tongue in ultrasound and lips in frontal view lip videos. DeepLabCut (DLC) uses deep learning and transfer learning to “perform markerless estimation of speech articulator keypoints using only a few hundred hand-labelled images as training input”. Their models achieve similar marker estimation performance of keypoints as data labelled by human annotators (see Wrench and Balch-Tomes (2022) for more details). Using this fully automatic method has great potential for the ultrasound data of the current study. As described in Turton (2017) applying PCA and LDA to image data can cause problems when interpreting the PCA scores in terms of an articulatory tongue to PC score mapping because of the messiness of the pixel image data. Applying PCA and LDA to tongue splines might therefore provide a better mapping PC score to tongue configuration due to significantly less messiness in the spline data (see Turton (2017)).

Running DLC on the current ultrasound image data reveals promising results. Preliminary plotting of automatically tracked tongues spline shows only minor problems for some speakers and/or tokens. Speakers F03, F08, F09 and M04 were excluded from the analysis due to insufficient quality of their ultrasound imaging, as can be seen the model also had difficulties tracking the tongue in these speakers (see Figure 2.18). Overall the results of the algorithm appear reasonable. Compared to the manually annotated tongue spline data in Thielking (2019) there seem to be only minor differences. In all speakers /∫/ shows more bunching than /sV/. The relationships between the four contexts for all speakers are comparable to the GAMM-smoothed splines reported in Thielking (2019).

PCA and LDA analysis on this data can therefore provide valuable insight into the relationship of lingual configuration and /s/-retraction as well as teasing apart the influence of the lips and tongue on this phenomenon. This analysis can also shed light on the usefulness of PCA analysis on image data compared to automatically/manually annotated spline data and the interpretability of such analyses in terms of a mapping between articulation and their quantitative measures.

LDC tracked splines

Figure 2.18: LDC tracked splines

LDC tracked mean splines for each context

Figure 2.19: LDC tracked mean splines for each context

References

Carignan, Christopher. 2014. “TRACTUS (Temporally Resolved Articulatory Configuration Tracking of UltraSound) Software Suite.” http://christophercarignan.github.io/TRACTUS/.
Faytak, Matthew, Suyuan Liu, and Megha Sundara. 2020. “Nasal Coda Neutralization in Shanghai Mandarin: Articulatory and Perceptual Evidence.” Laboratory Phonology 11 (1).
Faytak, Matthew, Scott R Moisik, and Pertti Palo. 2020. “The Speech Articulation Toolkit (SATKit): Ultrasound Image Analysis in Python.” In Proceedings of the 12th International Seminar on Speech Production (ISSP 2020). Online/New Haven, CT, 234–37.
King, Hannah, and Emmanuel Ferragne. 2020a. “Labiodentals /r/ Here to Stay: Deep Learning Shows Us Why.” Anglophonia 30 (1). http://journals.openedition.org/anglophonia/3424.
———. 2020b. “Loose Lips and Tongue Tips: The Central Role of the /r/-Typical Labial Gesture in Anglo-English.” Journal of Phonetics 80: 100978. https://doi.org/https://doi.org/10.1016/j.wocn.2020.100978.
Lawson, Eleanor, Jane Stuart-Smith, and Lydia Rodger. 2019. “A Comparison of Acoustic and Articulatory Parameters for the GOOSE Vowel Across British Isles Englishes.” The Journal of the Acoustical Society of America 146 (6): 4363–81. https://doi.org/10.1121/1.5139215.
Smith, Bridget J, Jeff Mielke, Lyra Magloughlin, and Eric Wilbanks. 2019. “Sound Change and Coarticulatory Variability Involving English/ɹ.” Glossa: A Journal of General Linguistics 4 (1).
Strycharczuk, Patrycja, and Koen Sebregts. 2018. “Articulatory Dynamics of (de) Gemination in Dutch.” Journal of Phonetics 68: 138–49.
Thielking, Niklas. 2019. “A Study on /s/-Retraction in Glasgow.” Unpublished MSc Thesis, Glasgow.
Turton, Danielle. 2017. “Categorical or Gradient? An Ultrasound Investigation of /l/-Darkening and Vocalization in Varieties of English.” Laboratory Phonology 8 (1).
Wrench, Alan, and Jonathan Balch-Tomes. 2022. “Beyond the Edge: Markerless Pose Estimation of Speech Articulators from Ultrasound and Camera Images Using DeepLabCut.” Sensors 22 (3). https://doi.org/10.3390/s22031133.